Welcome to the third practical on text mining!
The aim of this practical is to enhance your understanding in sentiment analysis and learn three different ways of performing sentiment analysis.

In this practical, we will focus on the following methods:

  • Dictionary-based methods (Unsupervised ).
  • TF-IDF (Term-frequency-Inverse-Document-Frequency) based methods (Supervised ).
  • (OPTIONAL) Using deep learning and word-embeddings.

Preparation

In this practical, we make use of the following packages:

library(tm)
library(text2vec)
library(tidyverse)
library(tidytext)
library(ggplot2)

library(caret)
library(rpart)
library(rpart.plot)

Text data

We are going to use one data set movie_review in this practical:

  • IMDB movie reviews is a labeled data set available with the text2vec package. This data set consists of 5000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews. Load this data set and convert it to a dataframe.
# load an example dataset from text2vec
data("movie_review")
as_tibble(movie_review)

Dictionary-based

The tidytext package contains 4 general purpose lexicons in the sentiments dataset.

  • AFINN: list of English words rated for valence between -5 and +5
  • bing: list of positive and negative sentiment
  • nrc: list of English words and their associations with 8 emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and 2 sentiments (negative and positive); binary
  • loughran: list of sentiment words for accounting and finance by category (Negative, Positive, Uncertainty, Litigious, Strong Modal, Weak Modal, Constraining)

1. We are going to use bing lexicon in this practical. Using the get_sentiments function, load the “bing” dictionary and store it in an objects called bing_sentiments.

bing_sentiments  <- get_sentiments("bing")
bing_sentiments

2. Use unnest_tokens function from tidytext package to break the text into individual tokens (a process called tokenization) and use head function to see its first several rows.

# tokenize the reviews
tidy_review <- movie_review %>% 
  unnest_tokens(word, review) %>% 
  select(-sentiment)  ## we don't use the original sentiment?

head(tidy_review)

3. Use inner_join function to find a sentiment score for each of the tokenized review words using Bing lexicon (i.e., bing_sentiments).

review_sentiment <- tidy_review %>%
  inner_join(bing_sentiments)

head(review_sentiment)

4. Count up how many positive and negative words there are in the specified ids. Then, compute the net sentiment score by subtracting the count of negative words from the positive words.

Hint: You can use count function from dplyr package.

review_sentiment <- review_sentiment %>% 
  # count the sentiment (positive/negative) per id
  count(id, sentiment) %>% 
  # wide-format
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  # compute the sentiment score
  mutate(sentiment_score = positive - negative)

5. Plot these sentiment scores across the ids.

Hint: Map id onto x-axis and sentiment_score onto y-axis.

review_sentiment %>% 
  mutate(color = ifelse(sentiment_score > 0, "pos", "neg")) %>%
  ggplot(aes(x = reorder(id, -sentiment_score), y = sentiment_score, fill = color)) +
  geom_col() + labs(x = "id") +
  theme_classic() + theme(axis.text.x=element_blank()) 

6. Create a confusion matrix, which is an insightful summary of the correct and incorrect classifications.

Hint: You can use table function.

# Dichotomize the sentiment score to match with the original sentiment scores
review_sentiment <- review_sentiment %>% 
  # 0: scores lower than five, 1: scores higher than five
  mutate(dicho_sentiment = ifelse(sentiment_score < 5, 0, 1))

## not sure why 5 rows are gone?!
movie_review <- movie_review %>% 
  filter(id %in% review_sentiment$id)

## terrible ...
table(true = movie_review$sentiment, predicted = review_sentiment$dicho_sentiment)
    predicted
true    0    1
   0 1830  651
   1 1811  703

TF-IDF based